Session 8 Practice: Publishing in R markdown

The script below is a complete data analysis of the 2016 US presidential election. Your job is to (i) understand what the code is doing, and (ii) convert the script into an R markdown document.
# County-level Data for 2016 Presidential Elections

# This is an analysis of US presidential elections data for 2016 at the county level. 
# Since only a small percentage of votes went to independent candidates, we will only 
# compare Democrat and Republican voteshare.
# The data for this analysis is taken from 
# https://github.com/tonmcg/County_Level_Election_Results_12-16.

#####
# DATA IMPORT AND CHECKING
#####
library(tidyverse)
library(knitr)

df <- read_csv("http://web.stanford.edu/~kjytay/courses/stats32-aut2019/Session%208/2016_US_County_Level_Presidential_Results.csv")
head(df)

# check no. of rows: matches no. of counties in US
# (Source: http://www.snopes.com/trump-won-3084-of-3141-counties-clinton-won-57/ and http://www.wnd.com/2016/12/trumps-landslide-2623-to-489-among-u-s-counties/)
nrow(df)

# dataset columns
# - `per_dem` and `per_gop` refer to the percentage of votes going to Democrats 
# and Republicans respectively.
# - `diff` represents the absolute difference between Republican votes - Democrat votes.
# - `per_point_diff` represents this difference as a percentage of total votes.
# - `combined_fips` is a 5-digit code identifying the county.
# (From https://en.wikipedia.org/wiki/FIPS_county_code: The FIPS county code is 
# a five-digit Federal Information Processing Standards (FIPS) code (FIPS 6-4) 
# which uniquely identifies counties and county equivalents in the United States, 
# certain U.S. possessions, and certain freely associated states.)
names(df)

# Since we are interested in whether a given county had more Republican or 
# Democrat votes, we have to recompute the `diff` and `per_point_diff` columns.
# `diff` and `per_point_diff` will be positive if there are more Republican votes 
# than Democrat votes (and vice versa).
df <- df %>% mutate(diff = votes_gop - votes_dem,
                    per_point_diff = diff / total_votes * 100)

#####
# SUMMARY STATISTICS
#####
# percentage of popular vote won by each party: Clinton actually wins!
paste0("Republican % of popular vote: ", 
       round(sum(df$votes_gop) / sum(df$total_votes) * 100, digits = 1),
       "%")
paste0("Democrat % of popular vote: ", 
       round(sum(df$votes_dem) / sum(df$total_votes) * 100, digits = 1),
       "%")

# number of counties won by each party: Trump wins by a lot
# hypothesis 1: Margin of victory was slimmer in the counties that Trump won 
# compared with the counties that Clinton won.
# hypothesis 2: Clinton won in counties with large populations
df %>% transmute(gop_won = votes_gop > votes_dem) %>%
    summarize(gop_won = sum(gop_won))

#####
# HISTOGRAMS
#####
# Test hypothesis 1: histogram of the `per_point_diff`
# Not true that Trump had narrower margins of victory
ggplot() +
    geom_histogram(data = df, mapping = aes(x = per_point_diff)) + 
    labs(title = "Histogram of % vote margin", 
         x = "% Republicans won by", y = "Frequency")

# Test hypothesis 2: histogram of `diff`
# Completely different picture! Note the x-axis scale
ggplot() +
    geom_histogram(data = df, mapping = aes(x = diff)) + 
    labs(title = "Histogram of absolute vote margin", 
         x = "No. of votes Republicans won by", y = "Frequency")

# show top 50 counties with largest absolute vote difference
# top 45 counties with largest absolute vote difference were all won by Clinton 
# number 46 was Montgomery, TX, which went to Trump
df %>% select(State = state_abbr, County = county_name, diff) %>%
    mutate(abs_diff = abs(diff)) %>%
    arrange(desc(abs_diff)) %>%
    select(State, County, `Vote difference` = diff) %>%
    head(n = 50) %>%
    kable()

#####
# CONCLUSION
#####
# When analyzing elections, we have to examine the data from many different 
# perspectives in order to get the full story.
Session 8 Practice: Publishing in R markdown

Kenneth Tay

Oct 17, 2019